Kernel trick in SVM allows to use higher-dimensional mappings \(\varphi(X)\). Provide the mapping \(\varphi()\) for vectors \(X\) and \(Y\), when the dot product \(\varphi(X)\varphi(Y)\) can be calculated as \(<X \cdot Y>^{3}\). For simplicity - assume 2-dimensional space.
Solution:
Kernel is essentially a mapping function - one that transforms a given space into some other (usually very high dimensional) space. While solving the SVM, we only need to know the inner product of vectors in the coordinate space. Say, we choose a kernel \(K\) and \(X\) and \(Y\) are two points in the original space. \(K\) would map these points to \(K(X)\) and \(K(Y)\) in the transformed space. To find the solution using SVM, we only need to compute inner product of the transformed points \(K(X)\) and \(K(Y)\). Kernels are also known as similarity functions ans some of the popular kernels are polynomial, radial basis function, and sigmoid kernel.In this task we are projecting 2 dimensional data to 4 dimensional space.
Read the article by Domingos: A few useful things to know about machine learning . Make a list of key messages with a supporting 1-2 sentence example or clarification of that message (something like short summary of the article)
Solution:
The paper presents 12 key messages that machine learning researchers and practitioners have learnt from experience and which are hard to find in textbooks.
1. Learning = Representation + Evaluation + Optimization
When faced with a machine learning algorithm, don’t get lost in the hundreds of possible machine learning algorithms you could be using. Instead, focus on three key components:
The selection of these three categories defines the set of possible models that can be learned, what the “best” model looks like, and how to search for that model.
2. It’s Generalization That Counts
The fundamental goal of machine learning is to generalize beyond the training set. So, the key messages in this part is that while we are building a model it is important to have separate test and training data because a model can learn the training data exceptionally well, but perform miserably on a new data. This is why cross-validation is so important - without evaluating a model on data not seen during training, there’s no way to tell how good it actually is.
3. Data Alone Is Not Enough
While creating a model we must make assumptions beyond the data it is given in order to generalize beyond it. Simple assumptions (like the smoothness of the error function and similar examples have similar classes) will take us a long way. The author also infers that induction in case of machine learning is more powerful than deduction. Induction requires a small amount of knowledge as input and we must use this effectively.
4. Overfitting Has Many Faces
We can recognize overfitting when accuracy is high on the training data and low on the testing data. One way to interpret overfitting is to break down generalization error into two components: bias and variance. Bias is the tendency of the learner to constantly learn the same wrong thing. Variance is the tendency to learn random things irrespective of the signal. In Machine learning strong false assumptions can be better than week true ones, because learner with latter needs more data to avoid over fitting (this explains why sometimes naive Bayes is better then C4.5).
Some tools against overfitting are:
5. Intuition Fails In High Dimensions
As the number of features in a data set increases it becomes exponentially more difficult to train a model that generalizes well. In addition, similarity measures fail to effectively discriminate between examples in high dimensions because at very high dimensions everything looks alike. Our intuitions from three-dimensional space often do not apply to higher dimensional spaces. As a result we need to be careful when we add additional features because they may outweigh the benefits of having more of them.
6. Theoretical Guarantees Are Not What They Seem
Author provides a probabilistic hypothesis to choose a consistent classifier. Although this type of hypothesis are theoretically accepted, sometimes they lack accuracy and they may not work in practice. Another type of theoretical guarantee that author mentions is asymptotic.However, learning is a complex phenomenon and we cannot always justify it by theoretical guarantees.In other words even if a theoretical justification is provided and the model is working in practice that does not mean that this is a result of the theoretical justification.
7. Feature Engineering Is The Key
The choice of the features that we will use in our algorithm are the most important factor if the algorithm will be successful or not. As a result, major chunk of the time in machine learning project will be spent on feature designing. Feature engineering usually requires some amount of domain knowledge specific to the problem at hand. Therefore, it is very difficult to automate this process. One approach is to generate a large number of features and select those that best correlate with the class. This can work well, but a trap is to ignore the possibility of useful intra-feature non-linear relationships with the output variable.
8. More Data Beats A Clever Algorithm
When we try to overcome a model performance bottleneck, we have two options:
As a rule of thumb, a simple algorithm with lots of data beats a clever algorithm with a modest amount of data. But more data means more scalability issues. Therefore, the true challenge is to design a classifier which works smarter with larger data in small amount of time.
9. Learn Many Models, Not Just One
Instead of picking one algorithm and spending a lot of time to optimize it is better to try lots of different algorithms and ensemble them together. The three most popular ensemble methods are:
10. Simplicity Does Not Imply Accuracy
In machine learning, often is taken that given two classifiers with the same training error, the simpler of the two will likely have the lowest test error. Though Occam’s razor suggests that machine learning models should be kept simple, there is no necessary connection between the number of features of a model and its tendency to overfit.
A more interesting perspective is to consider complexity of a model in terms of the size of the hypothesis space for each classifier. A larger space is likely to be sampled less and the resulting classifier may be less likely to have been overfit to the training data.
11. Representation Does Not Imply Learnable
The main point in this part is that we need to focus on the problem if the target function be learned, not can if it be represented. Just because a function can be represented, does not mean that the function can actually be learnt because there are restrictions imposed by data, time and memory that limit the possibility the function to be learnt in a feasible manner. For example, decision tree learners can not learn trees with more leaves than the number of training data points. Finding methods to learn the deeper representations is one major research area the Machine Learning.
12. Correlation Does Not Imply Causation
Correlation may hint towards a possible cause and effect relationship but that needs to be further investigated and validated (for example if you can obtain experimental data then by all means do so). In other words, correlation can not be taken as proof of causation and needs to be further investigated.
Let’s define a circle on 2-dimensional space, with radius 3 and center in (5;5). Provide a function that defines whether a point on a plane is within the circle or outside. Provide the linear classifier formulation for this; and the scoring defined as distance to that separating hyperplane (for how strong is the prediction). Calculate that score for 5 different points - inside the circle and outside; near the hyperplane and far. E.g. use the points (2.8;2.8), (3;3), (6;6), (4;8), (9;9).
Solution:
We can check if a point \((x_i,y_i)\) is inside a circle with center \((x_c,y_c)\) using the following formula:
\[ d^2 = (x_i - x_c)^2 + (y_i - y_c)^2 \]
The point\((x_i,y_i)\) is:
I am assuming that the points that lie on the border on the circle are inside the circle.
inside.circle = function(center, radius, dots) {
dots.1 = length(dots[, 1])
x.coord = numeric(dots.1)
y.coord = numeric(dots.1)
for (i in 1:dots.1) {
if ((dots[i, 1] - center[1])^2 + (dots[i, 2] - center[2])^2 <= radius^2) {
x.coord[i] = dots[i, 1]
y.coord[i] = dots[i, 2]
}
}
x.coord = x.coord[x.coord != 0]
y.coord = y.coord[y.coord != 0]
coordinates = cbind(x.coord, y.coord)
theta = seq(0, 2 * pi, length = 2000)
a = center[1]
b = center[2]
x = a + (radius * cos(theta))
y = b + (radius * sin(theta))
# Plot cirlce and add points
plot(x, y, type = "l", xlim = c(0, max(dots[, 1])), ylim = c(0, max(dots[,
2])), main = "Dots classfication", ylab = "Y", xlab = "X")
points(dots[, 1], dots[, 2], col = "blue")
# The points inside the cirlce are red.
points(x.coord, y.coord, col = "red")
return(coordinates)
}
The points inside the circle are labled with red. Those are the points (3,3) and (6,6).
points <- data.frame(x = c(2.8, 3, 6, 4, 9), y = c(2.8, 3, 6, 8, 9))
inside.circle(c(5, 5), 3, points)
## x.coord y.coord
## [1,] 3 3
## [2,] 6 6
## [1] "Scores and classification for all points (0 outside, 1 inside)"
## x y score class
## 1 2.8 2.8 9.68 0
## 2 3.0 3.0 8.00 1
## 3 6.0 6.0 2.00 1
## 4 4.0 8.0 10.00 0
## 5 9.0 9.0 32.00 0
Generate randomly 20, 50, 100 points in the range (0..10,0..10) and label them as positive (within the circle) or negative (outside the circle). Learn 4 different classifier types (e.g. decision tree, random forest, SVM with 2 different kernels), in each case. Label the full area (all points) in (0..10, 0..10) with 0.5 step using the learned classifier (for example if for point (2, 1.5) your classifier predicts positive, you label that area positive). Visualize the “area” learned by classifier with two different colors.
Solution:
We can generate our data using the runif function and we can label if the data is in or outside a circle with center (5,5) and radius 3 using the code from the previous exercise.
# Generating 20 random points, similiarly is done for 50 and 100.
set.seed(1)
x1 <- runif(20, min = 0, max = 10)
y1 <- runif(20, min = 0, max = 10)
data_20 <- data.frame(x1, y1)
## [1] "Points classification, (1 inside, 0 out)"
## x1 y1 class
## 1 2.6550866 9.3470523 0
## 2 3.7212390 2.1214252 0
## 3 5.7285336 6.5167377 1
## 4 9.0820779 1.2555510 0
## 5 2.0168193 2.6722067 0
## 6 8.9838968 3.8611409 0
## 7 9.4467527 0.1339033 0
## 8 6.6079779 3.8238796 1
## 9 6.2911404 8.6969085 0
## 10 0.6178627 3.4034900 0
## 11 2.0597457 4.8208012 1
## 12 1.7655675 5.9956583 0
## 13 6.8702285 4.9354131 1
## 14 3.8410372 1.8621760 0
## 15 7.6984142 8.2737332 0
## 16 4.9769924 6.6846674 1
## 17 7.1761851 7.9423986 0
## 18 9.9190609 1.0794363 0
## 19 3.8003518 7.2371095 1
## 20 7.7744522 4.1127443 1
data_20$class <- as.factor(data_20$class)
svm20_r <- svm(class ~ ., data = data_20, kernel = "radial", cost = 10, scale = FALSE)
# plot.svm
plot(svm20_r, data_20)
# Scatterplot
plot(data_20$x1, data_20$y1, pch = 21, bg = c("red", "green3")[unclass(data_20$a1)],
main = "SVM_20 Radial Kernel", xlab = "x", ylab = "y")
data_50$class <- as.factor(data_50$class)
svm50_r <- svm(class ~ ., data = data_50, kernel = "radial", cost = 10, scale = FALSE)
plot(svm50_r, data_50)
plot(data_50$x2, data_50$y2, pch = 21, bg = c("red", "green3")[unclass(data_50$a2)],
main = "SVM_50 Radial Kernel", xlab = "x", ylab = "y")
data_100$class <- as.factor(data_100$class)
svm100_r <- svm(class ~ ., data = data_100, kernel = "radial", cost = 10, scale = FALSE)
plot(svm100_r, data_100)
plot(data_100$x3, data_100$y3, pch = 21, bg = c("red", "green3")[unclass(data_100$a3)],
main = "SVM_100 Radial Kernel", xlab = "x", ylab = "y")
# Plot decision boundaries
myplot <- function(model, data, class = NULL, predict_type = "class", resolution = 100,
showgrid = TRUE, ...) {
if (!is.null(class))
cl <- data[, class] else cl <- 1
data <- data[, 1:2]
plot(data, col = as.integer(cl) + 1L, pch = as.integer(cl) + 1L, ...)
# make grid
r <- sapply(data, range, na.rm = TRUE)
xs <- seq(r[1, 1], r[2, 1], length.out = resolution)
ys <- seq(r[1, 2], r[2, 2], length.out = resolution)
g <- cbind(rep(xs, each = resolution), rep(ys, time = resolution))
colnames(g) <- colnames(r)
g <- as.data.frame(g)
p <- predict(model, g, type = predict_type)
if (is.list(p))
p <- p$class
p <- as.factor(p)
if (showgrid)
points(g, col = as.integer(p) + 1L, pch = ".")
z <- matrix(as.integer(p), nrow = resolution, byrow = TRUE)
contour(xs, ys, z, add = TRUE, drawlabels = FALSE, lwd = 1)
invisible(z)
}
For SVM I plot the decision boundary in 3 different ways.
I am using SVM with radial and polynomial kernel. Similarly, the decision boundary is represented for the random forest and decision tree algorithm.
svm20_l <- svm(class ~ ., data = data_20, kernel = "polynomial", scale = FALSE)
plot(svm20_l, data_20)
svm50_l <- svm(class ~ ., data = data_50, kernel = "polynomial", scale = FALSE)
plot(svm50_l, data_50)
svm100_l <- svm(class ~ ., data = data_100, kernel = "polynomial", scale = FALSE)
plot(svm100_l, data_100)
rf20 <- randomForest(class ~ ., data = data_20, mtry = 6, nodesize = 30, ntree = 200,
type = classification)
rf50 <- randomForest(class ~ ., data = data_50, mtry = 6, nodesize = 30, ntree = 200,
type = classification)
rf100 <- randomForest(class ~ ., data = data_100, mtry = 6, nodesize = 30, ntree = 200,
type = classification)
tree20 <- J48(class ~ ., data = data_20)
tree50 <- J48(class ~ ., data = data_50)
tree100 <- J48(class ~ ., data = data_100)
Comment about the classifiers’s results
Classifiers create decision boundaries to discriminate between classes. Different classifiers are able to create different shapes of decision boundaries (e.g., some are strictly linear) and thus some classifiers may perform better for certain datasets.
We can notice that we get the best results using SVM with radial (Gaussian) kernel. This is because the radial kernel \(K(x,y) = exp(||x-y||^2/2 \sigma^2)\) can represent this nonlinear boundaries better then for example the polynomial kernel \(K(x,y) = (\gamma \cdot x^T \cdot y + r)^d, \gamma > 0\) , because the radial kernel is more flexible and cam model more functions in its function space.When can notice that even for data set with 20 samples we are getting circle like boundary and that the shape improves with adding more samples. Decision trees and random forest are also nonlinear classifiers which create nonlinear boundaries, but in this case their results are worse than those obtained using SVM with radial kernel. For a two class problem, reasonably clean, outlier free and small data is better to use SVM because generally random forests and decision trees need larger number of instances to work its randomization concept well and generalize to the novel data.
Generate a smiley face like area in first quadrant as the “true class”. Generate sample data points as in Task 5, you can make more data points. Use the SVM with radial basis function. Explain the principle of RBF use in the SVM learning.
Solution:
Using the following function we can plot a smiley with center (5,5) and radius 1. Using the function from exercise 1 we classify the points depending on that if they are in or outside the circle. The points inside the smiley are colored red.
rm(list = ls())
set.seed(21)
x_s <- runif(200, min = 0, max = 10)
y_s <- runif(200, min = 0, max = 10)
datas <- data.frame(x_s, y_s)
inside.circle = function(center, radius, dots) {
dots.1 = length(dots[, 1])
x.coord = numeric(dots.1)
y.coord = numeric(dots.1)
for (i in 1:dots.1) {
if ((dots[i, 1] - center[1])^2 + (dots[i, 2] - center[2])^2 <= radius^2) {
x.coord[i] = dots[i, 1]
y.coord[i] = dots[i, 2]
}
}
x.coord = x.coord[x.coord != 0]
y.coord = y.coord[y.coord != 0]
coordinates = cbind(x.coord, y.coord)
theta = seq(0, 2 * pi, length = 2000)
a = center[1]
b = center[2]
x = a + (radius * cos(theta))
y = b + (radius * sin(theta))
# Plot smiley and add points
plot(x, y, type = "l", xlim = c(0, max(dots[, 1])), ylim = c(0, max(dots[,
2])), main = "Dots classfication", ylab = "Y", xlab = "X")
draw.circle(5 + 1, 5 + 1, radius = 0.1, lwd = 2, border = "black")
draw.circle(5 - 1, 5 + 1, radius = 0.1, lwd = 2, border = "black")
draw.arc(5, 5, radius = 0.7, deg1 = 185, deg2 = 355, lwd = 1.2 * 3)
points(dots[, 1], dots[, 2], col = "blue")
# The points inside the cirlce are red.
points(x.coord, y.coord, col = "red")
return(coordinates)
}
inside.circle(c(5, 5), 2, datas)
## [1] "Scores and classification for all points (0 outside, 1 inside)"
## x_s y_s val class
## 1 7.861149 2.08714799 16.67088 0
## 2 2.524456 4.97031143 6.12920 0
## 3 6.992523 0.01621104 28.80830 0
## 4 1.844608 3.78035024 11.44405 0
## 5 9.596138 4.92502044 21.13011 0
## 6 9.186834 2.55517746 23.50674 0
Then we can run SVM with radial kernel.The radial or Gaussian kernel \(K(x,y) = exp(||x-y||^2/2 \sigma^2)\) is used as a similarity function between features and it is used for finding better choice of features for the classification problem. \(||x-y||^2\) basically represents the squared Euclidean distance between two feature vectors. When \(x \approx y\) the kernel’s output is \(\approx 1\) which means that the features are similar and close. When they are far from each other the output is \(\approx 0\). Each y defines a new feature and the feature space of the kernel has an infinite number of dimensions for $ =1$. If we are overfitting we can increase the size of \(\sigma^2\). It is better to use linear kernel when we have many features and small set. In other cases it is better to use the radial kernel.Radial kernel is used more often compared to polynomial kernel because it is generally more flexible than the linear or polynomial kernels in that you can model a whole lot more functions with its function space.
datas$class <- as.factor(datas$class)
svm <- svm(class ~ ., data = datas, kernel = "radial", cost = 10, scale = FALSE)